A Bayesian model for identifying cancer subtypes from paired methylation profiles

Abstract Aberrant DNA methylation is the most common molecular lesion that is crucial for the occurrence and development of cancer, but has thus far been underappreciated as a clinical tool for cancer classification, diagnosis or as a guide for therapeutic decisions. Partly, this has been due to a lack of proven algorithms that can use methylation data to stratify patients into clinically relevant risk groups and subtypes that are of prognostic importance. Here, we proposed a novel Bayesian model to capture the methylation signatures of different subtypes from paired normal and tumor methylation array data. Application of our model to synthetic and empirical data showed high clustering accuracy, and was able to identify the possible epigenetic cause of a cancer subtype.


Introduction
Cancers are highly complex and heterogeneous diseases. According to certain characteristics, including different histologies, molecular profiles and specific mutations, cancers can be grouped into different subtypes, which play a crucial role in providing proper treatment and prognosis [1][2][3]. For example, based on visual characteristics of tumor tissues under microscopes, a cancer can be divided into several clinically meaningful subtypes, under a procedure called a histological evaluation [4,5]. For another instance, it is a fundamental and validated procedure to analyze gene expression profiles to identify cancer subtypes, which could be utilized as predictive markers to design personalized treatments [6][7][8]. Since genetic and epigenetic alternations are commonly known causal factors of cancers [9][10][11][12], grouping cancers according to different DNA methylation aberration patterns could promote the understanding of the critical role of epigenetic mechanisms in cellular processes and improve the effectiveness of cancer detection, diagnosis and prognosis [13][14][15].
Specifically, differential methylation analysis has shown that alterations in DNA methylation are closely associated with the occurrence and development of tumors [16][17][18][19]. Hypermethylation can result in gene silencing by repressing transcription of the promoter regions of tumor suppressor genes [20], whereas global hypomethylation can increase genomic and chromosomal instability [21,22], both leading to cancer development. As some CpG islands are only methylated in specific tumor types, analysis of cancer type-specific differentially methylated regions revealed stochastic methylation variation that can be identified as signatures to distinguish different types of cancers [23]. Moreover, analysis of the distinct methylation patterns may distinguish different histological subtypes of lung cancers [24]. However, it is difficult to predict the clinical progression by current prognostic factors and treatments. But DNA methylation profiles can be used to identify tumor subtypes and provide a better understanding of individual tumor biology, which will help clinicians pick personalize patient treatment strategies [25]. All these studies suggest that the subtypes of cancers can be identified by the unique DNA methylation signature.
Recently, particularly promoted by high-throughput measurement technologies, a variety of computing methods have been applied to cluster cancers to subtypes. We will review some of the most related works. Auwera et al. [25] developed a statistical method to investigate some specific DNA methylation patterns, which was applied to distinguish subtypes of breast cancers. Rhee et al. [26] introduced an integrated approach combining DNA methylation and gene expression data to analyze breast cancers, where the results showed that methylation status may contribute to the inference of cancer subtypes. By analyzing the cancer cell-intrinsic signals from bulk tumor transcriptomic data, the computational deconvolution method DeClust stratifies patients into subtypes and provides the tumor-type-specific stromal profiles [27]. Zhang et al. [28] proposed InfiniumClust by applying mixture of Gaussian distributions to cluster cancer subtypes with consideration of tumor purity. Siegmund et al. [29] introduced a Bernoulli-lognormal mixture model to cluster the cancers based on DNA methylation data obtained by the quantitative Methy-Light technology. Holm et al. [13] used K-means and hierarchical clustering to cluster cancers, and their results demonstrate that there are three groups of breast cancers with specific methylation profiles. Nonnegative matrix factorization method was applied to decompose DNA methylation profiles to categorize five consensus clusters, which were correlated with specific genetic abnormalities [30]. Furthermore, some methods integrated different types of data for clustering. For example, based on DNA methylation, gene expression, copy number, mutational and clinical data from pancreatic patients, Mishra and Guda [31] investigated the correlation between DNA methylation and differential gene expression profiles. For a more comprehensive review of the molecular subtyping methods, one can refer to Zhao et al. [32].
Different from subtyping methods which rely on predefined genes or regions, de novo subtyping methods automatically identify biomarkers for patient clustering in a data-driven fashion. Thus de novo methods are more useful in the scenario where the molecular mechanism is less clear, such as the effect of differential methylation. Some de novo subtyping methods are supervised learning method, which need a training dataset where the patients are well labeled to groups. Such supervised subtyping methods cannot help on complex diseases without enough labeling. A major problem of the existing de novo unsupervised subtyping methods is that the subtypes they infer may not be related to any disease mechanism, but purely correspond to different values of confounding variables such as age or race. To get rid of such confounding variables, more and more biomedical studies collected paired samples from tumor tissue and adjacent normal tissue, with the hope that the comparison between the paired samples can reveal how normal cells changed to cancer cells. A principled statistical method is needed to explicitly model such changes in order to efficiently detect the typical cancerization paths, where each path corresponds to a cancer subtype. In other words, traditional subtyping methods produce cancer status clusters, but we are interested in detecting aberrant-changes clusters.
In this paper, we propose a Bayesian mixture model for Subtyping (BaySub), which can provide the aberrant DNA methylation patterns of the cancer subtypes. Next, we describe our novel algorithm and demonstrate its performance in both synthetic and empirical data. The Bayesian estimation procedure for this model is provided in the Supplementary Information.

Data
We analyze paired DNA methylation array data from cancer patients. The original DNA methylation β-value is the ratio of the methylated probe intensity to the overall intensity, which ranges from 0 to 1. To better fit the Gaussian assumption of statistical modeling, M-value is more widely used to analyze differential methylation as it is a real number and approximately homoscedastic [33], which is defined as M = log 2 ( β 1−β ).

Model
The BaySub algorithm is illustrated in Figure 1. Suppose we have n pairs of M-value vectors for matched tumor and normal samples, , where x i and y i are both M-value vectors for m CpG sites. We assume the n normal samples are evolved from a reference normal cell with only non-diseasecausing changes at a small percent of CpG sites. As a CpG site is either methylated or unmethylated, a binary variable r j following Bernoulli distribution is used to model the methylation status of the reference normal cell at the j-th CpG site: for j = 1, 2, . . . , m, which means the variable r j takes the value 1 with methylation probability γ .
For every normal tissue, we assume its binary underlying methylation profile {z ij } is generated from the reference normal cell profile r j with random turning over probability 1 − δ: where '⊕' stands for the exclusive-or operator.
For a normal tissue to become a tumor tissue, we assume that there are s different paths from the corresponding normal tissue, each originated from a binary modification profile {w kj } and leading to a different subtype of the cancer of interest. More specifically, we assume the probability of turning over an original status in z ij is β, i.e.
The binary tumor methylation profile is assumed to be generated from the corresponding normal tissue profile with the perturbation at a subset of sites specified by the corresponding subtype signature: for i = 1, 2, . . . , n, j = 1, 2, . . . , m, k = 1, 2, . . . , s.
Due to the methylation measurement error or the cell heterogeneity of a tissue, we assume the observation M-value follows a normal distribution centered around a typical value of the corresponding methylation status.
For the i-th normal tissue at the j-th CpG site, the measured Mvalue x ij is assumed to following N(μ 0 , σ 0 ) if the corresponding z ij says it is unmethylated (i.e. z ij = 0), otherwise following N(μ 1 , σ 1 ): Similarly, we assume for the i-th tumor tissue at the j-th CpG site, the measured M-value y ij : for i = 1, 2, . . . , n, j = 1, 2, . . . , m.
In the above model, each row of the modification matrix W provides a unique path from a normal cell to a cancer cell, thus corresponding to one cancer subtype. If a cell in a row of W takes Figure 1. A f lowchart to illustrate the BaySub algorithm. The methylation status of the reference normal cell is represented by a binary vector R, and the rate of methylation is set to be γ . Based on this reference methylation status of normal tissue, a binary underlying methylation profile Z could be generated with mutation probability 1 − δ. Suppose there are s different paths W changing the methylation status from normal tissues to become tumor tissues, and the mutation rates is fixed at β. After obtaining the membership of the cancer subtypes λ, the tumor methylation profile can be generated from normal methylation profile Z, the membership λ and the mutation paths W. Therefore, the M-values of normal tissues follow normal distribution according to its methylation profile Z, and corresponding M-values of tumor tissues follow normal distribution according to its methylation profile φ. For individual patient, the corresponding λ i specifies its subtype membership, which will be estimated from its posterior distribution using a maximum a-posteriori (MAP) approach. For Bayesian inference, we introduced uniform priors and truncated uniform priors to ensure proper posterior purely determined by the data. Markov chain Monte Carlo (MCMC) algorithms are utilized to iteratively update all the parameters of our method until convergence. Detail distributions are given in the Supplementary Materials.

Synthetic data experiments
In this section, we evaluate our method BaySub on the synthetic datasets. To check the effect of different number of true subtypes, we perform three experiments following our model based on the setting in Table 1. For each experiment, we repeat for 20 times independently, i.e. generating 20 independent and identically distributed datasets and estimating separately, to evaluate both the mean performance for this experiment and the performance variation. For the Bayesian inference for each dataset, we run 10 independent chains of our MCMC algorithm. The number of iterations for each run of the MCMC algorithm is set to 200, which turns out to be enough for convergence and posterior sample collection. For convergence diagnosis, the key variables γ , μ 0 and μ 1 are selected to show the convergence of our method. One of 20 trials is randomly selected from each experiment. Figure 2 describes the changes of variable with the increasing of iterations. The results reveal that our method converges rapidly and stably. Besides, the Gelman-Rubin convergence diagnostic (R) is calculated to evaluate the convergence of MCMC. The values of variables selected from the last 60 iterations are divided into two sequences of length 30. It turns out that all theR values are smaller than 1.1, which indicates adequate convergence. Thus, both Figure 2 and Table 2 show rapid convergence of our algorithm.
We adopt two popular clustering performance measures, adjusted Rand index (ARI) and normalize mutual information (NMI), to evaluate the clustering performance of our method [34,35]. Moreover, we use two measures to quantify the inference accuracy of differential methylation. Based on the true value of aberrant methylation signaturesW, we calculate the inference accuracy of all elements in the variable W (denoted by AE), defined as Alternatively, we measure the site-wise inference accuracy by the percent of CpG sites whose status are totally correctly inferred Based on the above measurement, three synthetic datasets are utilized to evaluate the performance of BaySub method. As each experiment is repeated for 20 times independently, the mean performance and standard deviation are summarized in Table 3, which illustrates that our method obtains a high accuracy for both detecting cancer subtypes and identifying the methylation patterns.
For visualization, we plot the heatmap in Figure 3 to demonstrate the selected CpG sites, which can capture the specific methylation signatures associated with different cancer subtypes. For clarity, we only plot those differentially methylated CpG sites on which the value of corresponding elements in W should be 1 (i.e. differentially methylated between normal and tumor tissues). The colors in the left bar represent the true subtypes. From the figure, it is easy to see some block structures in heatmap, which indicates that these data include several methylation patterns.
Since each experiment is repeated for 20 times based on 20 independent datasets synthesized from a same parameter setting, after burn in, the estimates of all the variables in every trial are  Table 4 presents the variables estimated by our method. The results show that our estimating method preforms well on these simulation datasets.
In our method, the number of subtypes s should be given in advance, nevertheless it might be difficult to obtain the accurate value for real datasets. Here, Akaike information criterion (AIC) and Bayesian information criterion (BIC) are utilized to seek an appropriate model by compromising the goodness-of-fit and model complexity. We calculate the AIC and BIC values of the model with s values ranging from 2 to 6 on Experiment 2 and Experiment 3 datasets. The results are demonstrated in Figure 4, which shows that the minimum values corresponds with the true number of subtypes for both AIC and BIC. Hence, s can be readily learnt from the data.
One may still worry a wrongly specified number of subtypes may lead to meaningless results. To address this concern, we also test our method with a wrong number of cancer subtypes. We randomly select two datasets from Experiment 2 and Experiment 3 separately, and set the value of s to be the true value plus or minus 1. The experiments results are shown in Table 5. For example, when the s value is set to be 2 but the true value is 3, the first and third true subtypes are merged into one inferred group, which is marked by1. For another instance, when the s value is set to be 4 but the true value is 3, our algorithm may lead to two prediction results. In the first result, the inferred subtype3 is actually an empty class (thus purely dominated by the prior without support from the data), while the other three inferred subtypes correspond to the three true subtypes perfectly. In the second result, the third true subtype is divided into the two inferred subtype3 and4. All the results demonstrate that our method BaySub could deal with a slightly wrong number of cancer subtypes reasonably.

Subtypes Used s
Matching results As tumor heterogeneity affects the assignment of subtypes in the clinic, it is commonly concerned during data analysis that tumor tissues contain multiple cancer subtypes or are contaminated by normal cells during dissection. If every tumor contains more than one subtype, it would be more challenging for our method BaySub to deal with such a mixture of different subtypes and normal data.  different tumor tissues, k = 1, 2, . . . , s. Then a mixture dataset could be generated by integrating different subtypes and normal data by following steps: for the i-th paired mixture methylation vectors {x i , y i } n i=1 , x i ∈ R m , y i ∈ R m , first, two subtypes of cancers are randomly selected from all the subtypes (denoted by subtype A i and B i ), which are randomly selected from the subtypes {1, 2, . . . , s}. Second, the i-th paired methylation vectors is a linear combination of tumor tissues from these two selected subtypes and normal tissues, whose proportion is represented by a vector (P A , P B , P N ), i.e. {x i , The predicting target is set to be the major subtype, which is the largest proportion. Last, we repeat these two procedures to obtain the 'contaminated' datasets until the number of pairs equals 100. The results are shown in Table 6, which illustrate that our method could still achieve high accuracy without significant deterioration by these 'contaminated' data when one  Table 6. Performance of BaySub on 'contaminated' datasets. The true number of subtypes is reported in the first column. The proportions of (P A , P B , P N ) are shown in the second column. The biggest number is the proportion of major subtype, which is the predicting target. All the accuracies are displayed in the third to sixth columns subtype gains an outright majority, while the predicting accuracy would decrease with the gradual disappearance of this majority advantage.
In our model, we assume that the methylation value of every CpG site is independent and identically distributed (i.i.d.), and the distributions of methylation profiles for both tumor and normal tissues follow the Gaussian distribution. In this part, we will evaluate the performance of our method BaySub on the simulation datasets, which are generated from t distribution and the methylation status of the tumor and normal tissues are dependent. First, the vector R, representing the methylation status of the reference normal cell, is separated into 100 regions randomly. For every CpG site, the methylation status takes the value 1 with a probability of 0.7 or 0.3, and the probabilities for two adjacent regions are different. Second, for every row of modification matrix W, it is also separated into 100 regions randomly. The values within one region are the same. They take the value 1 with a probability of 0.2. Therefore, the elements in both vector R and matrix W are dependent, so the generated methylation profiles for tumor and normal tissues are dependent. The other parameters are the same as the values shown in Table 4. Although the BaySub algorithm is based on the assumption of i.i.d. and normal distribution, the experiment results illustrated in Table 7 verify that it could obtain high prediction accuracy and extract meaningful methylation patterns from the whole picture of cancer.

Empirical data experiments
In this section, we evaluate the performance of our method Bay-Sub on the real data of human cancer. We downloaded the methylation data from both the research on whole-genome sequencing in gastric cancer and TCGA database, which contain the methylation array data from both tumor and corresponding normal tissues [36]. All the DNA methylation profiles are measured by the same probe layout of 450K array, which is the most widely used platform for DNA methylation. Based on the research, gastric cancer is a heterogeneous disease, which can be separated into three different molecular subgroups: microsatellite instability (denoted by MSI), Epstein-Barr virus (denoted by EBV) and others. For TCGA database, based on the primary sites, these datasets are used to compose the second trial. Since these methylation profiles have some missing data, the common CpG sites shared by all datasets in the corresponding trial will be applied to infer the cancer subtypes. More details of every trial are shown in Table 8. The number of iterations is 200. The number of chains is set to 10. For the evaluation of the predicting accuracy, the true molecular subgroup of human cancer can be utilized to evaluate the estimated variable λ. To evaluate the achieved methylation signatures W, we centralized all the CpG sites for every cancer subtype to generate a standard valueW. First, we calculate the mean Mvalue vectors of every cancer subtype on all m CpG sites, denoted by {x k , y k }, where x k ∈ R m , y k ∈ R m , k = 1, 2, . . . , s. Then, based on the definition of M-value, 0 is selected as the segmentation criterion to divide all CpG sites for the every cancer subtype into hypermethylation and hypomethylation. For example, for the jth CpG site of k subtype, the average status of methylation for tumor is represented by c y k,j = 1, if y k,j > 0, otherwise c y k,j = 0, and it is the same with the average methylation status for normal data c x k,j . Therefore, the average status of methylation on all CpG sites for every subtype can be represented by {c x k,j , c y k,j }, where j = 1, 2, . . . , m, k = 1, 2, . . . , s. Finally, we can calculate the target vari-ableW = (w kj ) s×m , wherew kj = c x k,j ⊕ c y k,j . Based on the obtainedW, we can calculate both the accuracy of elements and the accuracy of CpG sites in variable W. The comparison results are shown in Table 9, which illustrates the good performance of our method BaySub on both predicting accuracy and identifying methylation patterns of different cancer subtypes. The posterior estimate of key parameters are listed in Table 10, including posterior mean and standard deviation.
Last, we also analyze the convergence of our method on the two real datasets. Both Figure 5 and Table 11 show that our method on real datasets converges rapidly.

Performance comparison
In this section, we compare the performance of BaySub with other popular clustering methods: K-means [37], K-medoids (PAM, CLARA) [38], Hierarchical Clustering (denoted by HC) [39] and Non-negative Matrix Factorization (denoted by NMF) [40]. As the clustering results of algorithms K-means and NMF are unstable due to different initial values, the average accuracy of 100 trials is reported as the comparison performance. The values of s for all methods are set to be 4. First, we compare the accuracy of  Table 4. To improve the diversity of methylation statuses for tumor and normal tissues, the mean and the variance for normal tissue (μ, σ ) and tumor tissue (α, η) are increased, respectively. The value of β is decreased from 0.2 to 0.05 to reduce the difference between cancer subtypes. Based on Table 12, these factors have little inf luence on our method BaySub, while the performances of other methods decline dramatically. Moreover, we also compare the performance of BaySub with other clustering methods on 'contaminated' datasets. We change the purity of tumor by reducing the dominance of major subtype, and alter the probability β turning over on the modification profile w. The comparison results in Table 13 display that the purity and the difference effect the performances of both BaySub and other clustering methods, but the predicting accuracy of BaySub method outperforms others. Last, we compare the performance of BaySub with other methods on empirical data, whose results are shown in Table 14.
In conclusion, based on Tables 12-14, our method BaySub obtains

Conclusions
In this paper, we propose a Bayesian mixture model named Bay-Sub for predicting the cancer subtypes based on paired methylation array data. Our method can capture the methylation signatures for different cancer subtypes. We evaluate the performance of our method on both synthetic and empirical data experiments.
In synthetic data experiments, our method achieve high predicting accuracy. Besides, the results of these experiments reveal that the proposed algorithm has good robustness and convergence. Moreover, our method can not only seek an appropriate number of subtypes by AIC and BIC values, but also can deal with the wrong specification of the number of subtypes. In real data experiments, the performance of our method is evaluated on two datasets, which are generated from both the gastric cancer project and TCGA database. The results illustrate the good performance of our method on real datasets, and the estimated parameters of the model converge rapidly and stably. Furthermore, in the section of performance comparison, BaySub algorithm provides better performance than some other clustering methods on both synthetic and empirical datasets. Note that BaySub can essentially deal with any disease where methylation changes play an important role, as long as each patient provides both a normal and a disease sample for methylation profiling. There are several directions to improve our method. First of all, we assume the methylation values of every CpG site are independent and identically distributed (i.i.d.) to reduce the complexity of model. Although simulation experiments show that if the simulated datasets are generated from heavy tail distribution and correlated methylation levels, BaySub based on the i.i.d. assumption could still extract meaningful information from the whole picture of simulated datasets, methylation statuses of neighboring sites seem positively correlated in reality instead of independent. Thus, future studies on releasing this i.i.d. assumption may contribute to improving the performance of cancer subtyping. Secondly, the current version of BaySub is designed for the methylation array data, but the binary methylation profiles used in BaySub essentially fits the single-cell methylation data well. One may modify the Gaussian observation model to adapt BaySub for single-cell methylation data analyses. Moreover, BaySub is currently designed for paired methylation data from tumors and adjacent normal tissues. In real tumor studies, not all patients can provide paired samples. Thus, it is important to extend BaySub by integrating unpaired and paired samples.

Key Points
• We propose a new Bayesian method named BaySub for clustering cancer patients based on paired methylation array data, which can simultaneously identify the methylation changes associated with different cancer subtypes. • Our method can not only seek an appropriate number of clusters through AIC and BIC, but also properly handle the wrong specification of the number of subtypes. • We evaluated the performance of our method based on both synthetic and empirical data, which showed that our method could achieve high clustering accuracy.

Sensitivity analysis
We conduct sensitivity analysis to investigate the inf luence of the truncated uniform priors. The truncation boundaries for μ c and α c is set to be (-50,50), (-100,100) and (-200,200), respectively, and that for σ 2 c and η 2 c is (0,100), (0,200) and (0,400). The experiment results illustrate the choices of truncation boundary have little effect on the performance of algorithm BaySub.

Comparison on computational cost
For Simulation 1 to Simulation 3 in this paper, both the K-means, K-medoids and Hierarchical Clustering methods only take about 1 min, while BaySub takes about 10 iterations (about 90 s) to converge and 200 iterations (about 28 min) to collect posterior sample according to Figure 5, although it is time-consuming to perform Bayesian inference by MCMC for large datasets and highdimensional models. Consider the datasets are off line and MCMC methods can provide the full picture of posterior distribution, it is acceptable.